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Abstract. Although prospective logistic regression is the standard 
method of analysis for case-control data, it has been recently noted that 
in genetic epidemiologic studies one can use the "retrospective" likeli- 
hood to gain major power by incorporating various population genet- 
ics model assumptions such as Hardy- Weinberg-Equilibrium (HWE), 
gene-gene and gene-environment independence. In this article we re- 
view these modern methods and contrast them with the more classical 
approaches through two types of applications (i) association tests for 
typed and untyped single nucleotide polymorphisms (SNPs) and (ii) es- 
timation of haplotype effects and haplotype-environment interactions 
in the presence of haplotype-phase ambiguity. We provide novel in- 
sights to existing methods by construction of various score-tests and 
pseudo-likelihoods. In addition, we describe a novel two-stage method 
for analysis of untyped SNPs that can use any flexible external algo- 
rithm for genotype imputation followed by a powerful association test 
based on the retrospective likelihood. We illustrate applications of the 
methods using simulated and real data. 

Key words and phrases: Case-control studies, Empirical-Bayes, ge- 
netic epidemiology, haplotypes, model averaging, model robustness, 
model selection, retrospective studies, shrinkage. 



1. INTRODUCTION 

Case-control study designs are now widely used to 
study the role of genetic susceptibility in the etiology 
of rare complex diseases. Typically, a case-control 
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study involves recruiting all or a large fraction of the 
diseased subjects (cases) that arise in an underlying 
study base and then sampling a comparable num- 
ber of healthy subjects (controls), ideally from the 
exact same study base, and possibly matched with 
the cases by some socio-demographic characteristics 
such as race, age and gender. Biological samples and 
questionnaire data collected on the sampled subjects 
are then used to determine their genetic suscepti- 
bility, such as SNP genotypes and history of some 
nongenetic (environmental) exposures. For rare dis- 
eases such as cancers, case-control studies are cost- 
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efficient compared to a cross-sectional or prospective 
cohort studies because they dramatically reduce the 
number of nondiseased subjects to study. 

In general, the standard method for analysis of 
case-control data is the prospective logistic regres- 
sion ignoring the retrospective nature of the under- 
lying design. The validity of this approach relies on 
the classic results by Cornfield (1956) who showed 
the equivalence of prospective- and retrospective- 
odds ratios. The efficiency of the approach was es- 
tablished in two other classic papers by Andersen 
(1970) and Prentice and Pyke (1979), who showed 
that the prospective analysis of case-control data 
yields the proper maximum-likelihood estimates of 
the odds ratio parameters of the logistic model un- 
der a "semiparametric" setup that allows the distri- 
bution of the underlying covariates to remain com- 
pletely unrestricted. More recently, it has been shown 
that even in the presence of missing data and mea- 
surement error in covariates, the "prospective" treat- 
ment of case-control data can yield proper maximum- 
likelihood estimates as long as the distribution of 
the underlying covariates is allowed to remain unre- 
stricted (Roeder, Carroll and Lindsay, 1996). 

A special feature for studies in genetic epidemiol- 
ogy is that it is often reasonable to assume certain 
models for the population distribution of the ge- 
netic and environmental covariates of interest. The 
Hardy-Weinberg-Equilibrium (HWE) law, for ex- 
ample, which implies a simple relationship between 
allele and genotype frequencies at a given chromoso- 
mal locus, is a natural model for a random mating, 
large, stable population in the absence of new ge- 
netic mutations, inbreeding and selective survivor- 
ship among genotypes (see Hartl and Clark (2007), 
Chapter 3). Genes which are physically apart and 
hence are not expected to be in linkage disequilib- 
rium (LD) are also expected to be independently 
distributed in a homogeneous population. It is often 
also natural to assume a subject's genetic suscepti- 
bility, a factor which is determined at birth, is inde- 
pendent of his/her subsequent environmental expo- 
sures. A pertinent question then is what is the most 
appropriate method for analysis of case-control data 
in genetic epidemiology where some natural model 
assumptions exist for the distribution of genetic and 
environmental factors in the underlying population. 

We will assume data on some genetic (67) and en- 
vironmental (E) exposures are collected in a case- 
control study involving Nq controls (D = 0) and N\ 
cases (D = 1). If one ignores the retrospective nature 



of the case-control design, one can conduct inference 
based on the prospective-likelihood 

JV 

(1) L p = ]Jpr(A|G 4 ,^), 

i=l 

where N = N\ + Ao- The fundamental likelihood for 
case-control data, however, known as the "retrospec- 
tive" likelihood, is given by 

N 

(2) L R = J]pr(G i ,^|A). 

i=i 

In the absence of any missing data, it is evident 
from the classical theory that the prospective- likelihood 
(1) provides a valid way of testing and estimation of 
the odds ratio association parameters of the under- 
lying logistic regression model. In fact, the prospective- 
likelihood yields the same maximum-likelihood esti- 
mates for the odds ratio association parameters that 
could be obtained by maximization of the proper 
retrospective likelihood (2) while allowing pr(G,E), 
the joint distribution of G and E, to remain com- 
pletely non-parametric. Under constraints on pr(G7, E), 
however, the retrospective likelihood would not yield 
the same maximum-likelihood estimator as that from 
the prospective likelihood. More importantly, the 
retrospective-likelihood can exploit various popula- 
tion genetics model assumptions such as HWE, gene- 
gene and gene-environment independence to gain 
major efficiency over the prospective-likelihood for 
inference on various association and interaction pa- 
rameters. At the same time, if the underlying model 
assumptions are violated, then the use of the retro- 
spective likelihood can lead to serious bias for both 
testing and estimation procedures. In the presence 
of missing data, a further complication is that the 
use of the prospective likelihood may not be even 
strictly valid in certain settings, such as that de- 
scribed in Section 4 for estimation of haplotype ef- 
fects, where for the purpose of identifiability L p also 
requires some modeling assumptions, thus destroy- 
ing its equivalence with L R that is known to hold 
under unspecified covariate distribution. Thus, to 
date, a debate remains about the most appropriate 
method of analysis of case-control studies in genetic 
epidemiology. 

In this article we will review some modern devel- 
opments for analysis of case-control studies in ge- 
netic epidemiology using the prospective- and 
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retrospective-likelihoods. We will describe the meth- 
ods primarily through two different types of applica- 
tions: (a) association testing for genotyped and im- 
puted single nucleotide polymorphisms (SNP) (Sec- 
tions 2 and 3) and (b) estimation of haplotype effects 
and haplotype-environment interactions in the pres- 
ence of phase ambiguity (Section 4). In each section 
we aim to provide new intuitive insights into the al- 
ternative methods by constructions of various score 
tests (Sections 2 and 3) and pseudo-likelihoods (Sec- 
tion 4). As a byproduct, in Section 3 we also propose 
a novel "retrospective" method for association test- 
ing for untyped SNPs which can easily use any exter- 
nal algorithm for imputation of genotypes. In each 
section we will use numerical examples to illustrate 
the bias and efficiency trade-off between the alter- 
native methods. We will conclude the article with a 
discussion and recommendations for practical data 
analysis. 

2. ASSOCIATION ANALYSIS FOR SINGLE 
NUCLEOTIDE POLYMORPHISMS (SNPS) 

2.1 The Prospective Approach 

The genotype information for an individual SNP 
in a case-control study can be represented by the 
2x3 contingency table defined by cross-tabulation 
of case-control and genotype status. Let D be the 
indicator of case (D = 1) or control (D = 0) status 
and let G be the number of minor alleles carried 
by an individual (G = 0,l,2). Let n^g denote the 
number of subjects with genotype G = g and disease 
status D = d observed in the case-control sample. 
Suppose we are interested in testing the association 
of the disease outcome with a SNP-genotype using 
a population logistic regression model of the form 

(3) P^ D - 1 \G)- 1 + eMa + p T m{G) y 

where the function m(-) is chosen in a suitable way 
to reflect an assumed mode of genetic effect. If, for 
example, G denotes the count for the minor allele at 
a SNP locus, then one can chose m(G) = G, m{G) = 
I(G > 1) or m(G) = I(G = 2) to model the effect of 
the minor allele as additive (in the logistic scale), 
dominant or recessive. One can also consider a sat- 
urated model by allowing m(G) to be a vector of 
two dummy variables associated with heterozygous 
(G = 1) and homozygous variant (G = 2) genotypes 
and (3 to be the corresponding log-odds-ratios. The 
prospective analysis of case-control data yields an 



asymptotically unbiased estimate for the genotype- 
odds-ratio parameters (3, but not for the intercept 
parameter a. 

The score function for /3 under the prospective- 
likelihood (1) can be written as 



U PL 



m(Gi){Di 



pr(Z> = l|G<)}. 



8=1 



Under the null hypothesis, = 0, we can estimate 
p = pr(D = l\Gi) as p = N±/(Ni + No) since under 
that hypothesis, G does not influence D. Then the 
score function can be written as 



17? 



PL 



iVi 

^m(Gi 
i=l 



Ni+N 

™{Gi 

i=l 



1^ 



m(Gi 



1 

An 



i=JVi+l 



which is proportional to the difference between the 
empirical means of m(G) in the cases (D = 1) and 
in the controls (D = 0). We suppose without loss 
of generality that the indices for the cases are {i = 
1, . . . ,N±} and those for the controls are {i = N\ + 
1, . . . , N± + No}. If, for example, we assume m(G) = 
G, that is, the additive effect, then Up L corresponds 
to the numerator of the Cochran-Armitage trend 
test (van Belle et al., 2004, Chapter 7) that is widely 
used for single-SNP association testing. More gen- 
erally, a "prospective" score-test can be constructed 
under any genetic model based on U PL and its vari- 
ance under the null hypothesis of no association that 
be estimated by 



V? 



PL 



m(G) j 



where V m iQ\ is the pooled-sample variance of m(Gi). 

2.2 Retrospective Approach 

The retrospective likelihood, L R , for the genotype 
data of a single-SNP can be written as the product 
of two sets of multinomial probabilities: 



L = Li x Lr 



n^ 9 >< 

9=0 



IK 



n 0g 



9=0 



where pd g = pr(G = g\D = d), d = and 1, denotes 
the population genotype frequencies for the con- 
trols and the cases, respectively. Given the genotype 
probabilities for the controls, we can characterize 
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the genotype probabilities for the cases according 
to the formula (Satten and Kupper, 1993) 



(4) 



Pig 



T, 2 g=o1>g(P)POg 



where ip g {fi) denotes the odds ratio associated with 
the genotype G = g as specified by the logistic model 
(3). Thus, the retrospective likelihood can be pa- 
rameterized in terms of the genotype probabilities of 
the controls and the disease-odds-ratio parameters 
f3. The maximization of the retrospective likelihood 
L R , without imposing any further constraints on the 
genotype probabilities for the controls, will lead to 
the same estimator for j3 that would be obtained 
by maximization of L p (Prentice and Pyke, 1979). 
In fact, it can be shown that the retrospective- and 
prospective-profile likelihoods of /3 become identical 
after maximization of the corresponding likelihoods 
with respect to the associated nuisance parameters 
(Roeder, Carroll and Lindsay, 1996). Thus, the as- 
sociated tests, including score-, Wald- and likelihood- 
ratio tests, are identical under the retrospective and 
prospective likelihoods. 

Now suppose we are willing to assume that HWE 
holds in the underlying population and that the dis- 
ease is rare so that HWE also holds approximately 
in the control population. In the retrospective like- 
lihood L R , we can write the genotype probabilities 
for the controls function of the frequency, /, of 
the minor allele as 

Poo(/) = (l-/) 2 , Poi(/) = 2/(l-/), 

It is easy to show that the score function for /3 associ- 
ated with the retrospective likelihood can be written 

as 

iVi 

Url = X)KGi) - E UWEJ {m(G)\D = 1}], 

i=l 

which under the null hypothesis of no association 
reduces to 

(5) U° RL = ^2[m(G t ) - E UWEJ {m(G)}]. 
t=l 

Moreover, under the null hypothesis, the allele fre- 
quency / can be substituted for by its maximum- 
likelihood estimate 

n+i + 2n + 2 



where n +g denotes the frequency for genotype G = g 
in the pooled sample of cases and controls. Thus, 
Url corresponds to the difference between the em- 
pirical mean of the function m(G) in cases and its 
expected value under HWE and the null hypoth- 
esis of no association. In contrast, note that U PL 
corresponds to the difference between the empirical 
mean of the function m(G) in cases and the empir- 
ical mean for the same function in the controls. If 
the expectation in the retrospective score function 
(5) is estimated empirically without assuming HWE, 
then, as expected, it can be easily shown that the 
retrospective and prospective scores are the same. 
If, however, we assume HWE to evaluate the retro- 
spective score function, then it would have smaller 
variance than that for the prospective score. In par- 
ticular, this can be seen from the estimate of the 
variance estimate U RL given by 

Vbl = Ni\v m{G) - - /)C(/)C(/) T |, 



where 



C(f) = covhwejS m(G), 



9-V 
/(I-/) 



G-2f 
7(W) 

POg(f)- 



(6) 



/ 



2N 



By the Cauchy-Schwarz inequality, V RL > V RL asymp- 
totically, implying that the retrospective score test 
is asymptotically more powerful than its prospective 
counterpart when the assumption of HWE is valid. 

Chen and Chatterjee (2007) compared the perfor- 
mance of 2 d.f. Wald-tests of association based on 
the retrospective and prospective likelihoods. They 
observed major gains in power for the test based 
on the retrospective-likelihood for the detection of 
nonmultiplicative effects, for example, recessive ef- 
fects. Notice that if we assume an additive model, 
that is, m(G) = G, then the prospective and retro- 
spective score functions U RL and U PL become iden- 
tical because in this case E KWE j{m(G)} = 2f = 

YliLi G-i/N. The larger the departure of the effect of 
a SNP from the additive form, the greater the gain 
in efficiency for the retrospective method. Applica- 
tion of retrospective methods for association testing, 
however, requires caution because of their sensitivity 
to the underlying model assumption. In particular, 
it can be seen from the formula of U RL that the un- 
biasedness of that score function crucially depends 
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on the assumption of HWE being correct for the un- 
derlying population. Satten and Epstein (2004) and 
Chen and Chatterjee (2007) have noted that even 
modest violation of HWE can cause serious infla- 
tion in Type-I error in association tests based on 
the retrospective likelihood. 

2.3 Empirical-Bayes Methods 

Luo et al. (2009) considered an empirical-Bayes 
type shrinkage estimation approach to develop a 2 
d.f. single-SNP association test that can gain power 
by exploiting the model assumptions of HWE for 
the underlying population and yet is resistant to 
bias when the model assumptions are violated. The 
method involves estimation of genotype-specific dis- 
ease odds ratio parameters by data-adaptive "shrink- 
age" of a "prospective" model-free estimator that 
does not require the HWE assumption toward a 
"retrospective" model-based estimator that directly 
exploits the HWE constraints. The amount of 
"shrinkage" is sample-size and data-adaptive, so that 
in large samples the method has no bias whether 
the assumption of HWE holds or not and yet the 
method can gain efficiency by shrinking the analysis 
toward HWE, but only to the extent that the data 
validate the assumptions. In what follows we pro- 
vide some insight into the empirical-Bayes method 
through the construction of a score-test. For numer- 
ical illustration, however, we will focus on the Wald 
test as originally developed in Luo et al. (2009). 

Let fn(G) = (JVi + No)- 1 ^^ ™^) 
and 4 {G) = (m + No)' 1 E^imiG^-miG)} 2 
denote the sample mean and variance for the func- 
tion m(G), respectively. Further, let r = m{G) — 
E^ HWE m(G) denote the difference between the em- 
pirical and expected means of m(G) when the lat- 
ter quantity is computed assuming HWE and under 
the estimate of allele frequency / given in (6). Intu- 
itively, t can be viewed as an estimate of the bias in 
estimation of the population mean of m(G) under 
the assumption of HWE. An empirical-Bayes type 
score function can be now defined as 

(7) U° EB =J2[m(G i )-E EB {m(G)}}, 

i=i 

where EEB{ m (G)} is the empirical-Bayes estimate 
for the mean of the function m(G) under Hq, given 
by 



+ 



■m[G). 



E EB {m(G)} 



5 m(G) 



/N 



4 (G) Av + - 2 HWEJ 



Thus, EEBimiG)} corresponds to a weighted aver- 
age of the empirical mean of m(G) and its expected 
mean under HWE, with the weights defined by an 
estimate of the bias for the estimate of the popula- 
tion mean of m(G) under HWE and an estimate of 
the variance of the empirical mean of m(G). As r 2 
decreases, that is, the evidence of bias due to the vio- 
lation of HWE becomes smaller, EEB{ m (G)} gives 
more weight to the more precise HWE-based esti- 
mator of the population mean of m{G). Conversely, 
as s^L^/N decreases, that is, the sample mean of 
m(G) becomes more precise, then £ , ^{m(G)} puts 
more weight to the robust model-free estimator fn(G). 
The original perspective for constructing such 
weighted combinations of model-based and model 
free estimators from an empirical-Bayes point of view 
can be found in Mukherjee and Chatterjee (2008). 
Simple methods for variance estimation for such es- 
timators have been also described in that article. 

2.4 The Cancer Genetics Markers of 
Susceptibility (CGEMS) Study 

We illustrate the performance of alternative 2 d.f. 
single SNP association tests using data from the 
Cancer Genetics Markers of Susceptibility (CGEMS) 
study (Yeager et al., 2007; Hunter et al., 2007; 
Thomas et al., 2008), an NCI enterprize initiative to 
conduct multistage whole-genome association stud- 
ies to identify susceptibility genes giving rise to in- 
creased risks of prostate and breast cancers. In this 
article we will focus on data from the initial scan 
for the prostate cancer study, involving genotype 
data on about 550,000 SNPs from 1172 cases and 
1157 controls. The details of the CGEMS study de- 
sign and the results from the initial scan and sub- 
sequent replication studies can be found at the web 
site https : / / caintegrator.nci.nih.gov/cgems/. 

We consider 449,698 SNPs from 22 nonsex chro- 
mosomes with minor allele frequencies larger than 
0.05. Table 1 displays the empirical proportions of 
the number of SNPs that are found to be significant 
at different nominal significance levels using 2 d.f. 
tests based on three different methods: (a) prospec- 
tive, (b) retrospective and (c) empirical-Bayes [see 
Luo et al. (2009) for more details]. For a well-designed 
study and a robust analytic method, the empirical 
proportions are expected to be fairly close to the 
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Table 1 

The empirical proportions of significant SNPs detected by 
different methods at different nominal significance levels in 
the CGEMS prostate cancer study 



a 


Prospective 


Retrospective 


Empirical-Bayes 


5e-2 


5.01e-2 


5.66e-2 


4.49e-2 


le-2 


0.98e-2 


1.43e-2 


0.87e-2 


le-3 


1.05e-3 


3.85e-3 


1.00e-3 


le-4 


1.27e-4 


2.24e-3 


1.3 le-4 


le-5 


2.67e-5 


1.76e-3 


3.34e-5 


le-6 


2.22e-6 


1.47e-3 


4.45e-6 



nominal significant levels, given that the vast ma- 
jority of the SNPs are likely to be not associated 
with the disease. In Table 1, we observe that the em- 
pirical proportions of significant SNPs found by the 
prospective method closely follows the nominal sig- 
nificance levels. In contrast, the corresponding pro- 
portions for the retrospective test deviate severely 
from the nominal values in the range of a < 10~ 3 , 
indicating significantly inflated type-I error due to 
the violation of HWE for many SNPs. The last col- 
umn of Table 1 shows that the empirical-Bayes pro- 
cedure essentially corrects for all the bias of the ret- 
rospective method due to the violation of the HWE 
assumption. 

Next, we conducted a simulation study to inves- 
tigate the performance of various tests in ranking a 
true susceptibility locus in a genome-wide associa- 
tion study (GWAS) that include hundreds of thou- 
sands of "null" SNPs. To generate realistic link- 
age disequilibrium patterns, we simulated GWAS 
data mimicking the CGEMS study itself. Given mi- 
nor allele frequency among controls and the disease- 
genotype odds ratio parameters for a chosen suscep- 
tibility locus, we simulate genotype data at that lo- 
cus for the cases and controls separately from the 
corresponding multinomial distributions. Given the 
genotype data at the susceptibility locus for a case 
or a control, we simulate genotype data for the re- 
mainder of the SNPs by assigning the whole geno- 
type profile for a randomly selected subject from 
the controls of the CGEMS study who have the 
same genotype data at the given susceptibility lo- 
cus as the sampled subject in our simulation study. 
This algorithm, as originally described by Yu et al. 
(2009), assumes that given the genotypes for the 
susceptibility locus, the risk of the disease is inde- 
pendent of all the remaining SNPs. We simulated 
50 data sets with approximately 550 cases and 550 



controls. For each data set we tested for association 
for each of the approximately 450,000 SNPs using 
the prospective, retrospective and empirical-Bayes 
methods. The rank of the disease- associated SNP 
is obtained by sorting all the p-values in ascending 
order. 

Table 2 displays the median ranks obtained by 
three methods for a true disease-associated SNP that 
has a recessive effect with a log-odds-ratio of /3 = 
log(3). As expected, the ranks of all tests decrease 
as the minor allele frequency increases. Comparing 
the ranks of different tests at a specific minor allele 
frequency, we can see that the standard prospective 
method generally has the lowest power in the sense 
that it assigns much higher rank to the susceptibil- 
ity SNP than the two other tests. When minor allele 
frequency is 0.1, we observe that the pure retrospec- 
tive method performs the best in the sense that it 
assigns the lowest rank to the susceptibility SNPs 
among all the methods. In contrast, when minor al- 
lele frequency is greater than or equal to 0.2, we 
observe that the empirical-Bayes procedure assigns 
considerable lower rank to the susceptibility SNP 
than the pure retrospective method. Intuitively, the 
results can be explained from the fact that the ret- 
rospective method yields low p- values for many null 
SNPs due to the violation of the HWE assumption 
(see Table 1) and thus dilutes the rank of the real 
susceptibility SNP. 

3. ASSOCIATION ANALYSIS FOR IMPUTED 
SNPS 

The forms of the prospective- and retrospective- 
scores suggest how they can be modified easily for 
SNPs that may not have been directly genotyped, 
but can be "imputed" conditional on neighboring 
SNPs and estimates of linkage disequilibrium from 

Table 2 

Simulated median ranks of a true susceptibility SNP with a 
recessive effect and log-odds-ratio value of log(3) for 
alternative tests. The results are based on 50 simulated 
datasets, each of which has approximately 550 cases and 550 
controls and 450,000 SNPs. MAF: minor allele frequency 



MAF 


Prospective 


Retrospective 


Empirical-Bayes 


0.1 


112163 


8117 


44319 


0.2 


1888 


203 


52 


0.3 


656 


210 


27 


0.4 


15 


82 
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HapMap or other similar databases. Let N(G) de- 
note the neighboring genotype information for an 
untyped SNP-locus with unobserved genotype G. 
The prospective score for such an untyped SNP can 
be defined by taking the conditional expectation of 
the "complete data" score function U PL given the 
observed data, that is, the neighboring genotype in- 
formation. More formally, the prospective score for 
an untyped SNP can be written as 



ttOu 



NtN 



(8) 



1 Nl 

— 5>{m(G)|JV(G0} 



i=l 
N 



Jr£s{m(G)[^(G,)} 



No' 



i=l 



where the conditional expectations are taken with 
respect to a suitable imputation model such as those 
described by Nicolae (2006), Marchini et al. (2007) 
and others. The retrospective score for an untyped 
SNP can be similarly defined by the conditional ex- 
pectation of the "complete data" retrospective score 
function U RL given the observed data M(G) in the 
form 



Ni 



U° R l = Y,[E{m(G)\N(G t )} 



(9) 



i=i 



E UWEJ {m(G)}}. 



Notice that in the retrospective score function, the 
contribution of the term -EhwejI^G)} is a con- 
stant term given the allele frequency /. The estima- 
tion of the allele frequency / for an untyped SNP, 
however, requires imputation. In particular, under 
the "complete data" model we can write the esti- 
mate of the allele frequency under the null hypoth- 
esis of no association as 

E Wi {/(Gi = 1) + 2/(G . = 2)} 



/ 



2N 



Thus, given an imputation model, we can estimate 
the allele frequency / as 



/N +N! 

f u = V pr{G = l|AA(G)} 



(10) 



\ i=i 



+ 2pi{G = 2\M(G i )}\ (2N). 



We further need the variances for Up\ and Up^ 
under the null hypothesis to obtain the correspond- 



ing score tests. The variance of Up\ can be esti- 
mated as 

NxN 



V PL 



Ni+Nq 



■V, 



E{m{G)\N(G)}, 



where VE{ m (G)\M{G)} 1S the pooled-sample variance 
of E{m(G)\ftf(Gi)}. The prospective-score test is 
based on the test statistic given by 

where the superscripts T and — denote transpose and 
generalized inverse, respectively. Asymptotically, this 
statistic follows a chi-squared distribution under the 
null hypothesis of f3 = 0, with the degrees of freedom 
given by the dimension of m{G). The variance of the 
retrospective score Up L , after adjusting for the esti- 
mation of the allele frequency / by / given by (10), 
can be estimated by 



V, 



0u 
RL 



Ni 



VE{m{G)\M(G)} 

+ §{^™c(/)c(/) T 



QC(f) T -C(f)Q'> 



where Q is the pooled-sample covariance between 
E{m(G)\N(Gi)} and E{G\N(Gi)}. The variance of 
Up}^ can also be alternatively estimated by the ro- 
bust sandwich-type estimate given as 



T/ 0u 



Ni+N 

i=i 



RL. 



RLA) 



where the efficient score 

&RL,i = Di[E{m(G)MGi)} - E RWEj {m(G)}] 

-§fC(f)[E{G\N(G l )}-2f]. 

The retrospective-score test is then based on the test 
statistic given by 

(u° R l) T {v°t}-u° R i, 

which again follows a chi-squared distribution asymp- 
totically under the null hypothesis, with the degrees 
of freedom given by the dimension of m(G). In both 
the prospective- and retrospective-score tests given 
above, we obtain the conditional probability 
Pr{G\Af(Gi)} directly from some external reference 
database, for example, HapMap, a strategy similar 
to the proposal of Nicolae (2006). 
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We now demonstrate the potential power advan- 
tages that might be achieved by imputing the un- 
typed SNP, using numerical studies following two 
scenarios as in Tables 1 and 2 of Nicolae (2006). In 
Scenario 1 the untyped SNP can be perfectly pre- 
dicted by the genotypes of the typed SNPs, namely, 
the Rg = 1 (see Stram et al., 2004, for a defini- 
tion), while in Scenario 2 the untyped SNP is moder- 
ately predicted by the genotypes of the typed SNPs 
with R 2 S = 0.39. The SNP profiles together with the 
haplotype frequencies estimated from HapMap CEU 
samples in the two scenarios are summarized in Ta- 
bles 3 and 4. Also listed in Tables 3 and 4 are the 
haplotype frequencies we actually used to simulate 
the SNP data for the case-control sample, which 
moderately deviate from those seen in the HapMap 
CEU sample to reflect the potential discrepancy be- 
tween the HapMap and study samples. The haplo- 
type pair for each person is generated according to 
HWE. 

We simulated the case-control status by the lo- 
gistic regression model (3), where the genetic deter- 
minant G is given by the minor allele count of the 
untyped SNP, and the function m(-) is given by the 
recessive, dominant or additive genetic mode. The 
intercept a = —3.0, which yields an overall disease 
rate around 5%. Each analysis is based on a case- 
control sample with 1000 cases and 1000 controls. 
The simulation results are based on 1000 (3000) rep- 
etitions for evaluation of test power (size). All the 
tests are performed at a significance level of 0.01. 
The score tests are performed using the correct ge- 
netic model, and the retrospective-score tests are 
based on the robust sandwich-type variance esti- 
mates; results based on model-based variance es- 

Table 3 

The SNP profiles and haplotype frequencies for the region 
considered in Scenario 1, where the untyped SNP can be 
perfectly predicted by genotyped SNPs Ai, ... ,^4 (R 2 S = 1). 
Also listed are the haplotype frequencies estimated from the 
CEU sample in HapMap. Part of the data are from Table 1 
of Nicolae (2006) 



Haplotype of SNPs 

A1-T-A2-A3-A4 


Frequency 


Frequency 
from HapMap 


1-0-0-0-0 


0.158 


0.058 


0-1-0-1-0 


0.400 


0.300 


1-1-0-1-0 


0.050 


0.050 


1-1-1-0-1 


0.358 


0.558 


0-1-1-0-1 


0.022 


0.017 


1-1-0-0-1 


0.012 


0.017 



Table 4 

The SNP profiles and haplotype frequencies for the region 
considered in Scenario 2, where the untyped SNP is 
moderately predicted by genotyped SNPs Ai,.. . ,As 
(R 2 S =0.39). Also listed are the haplotype frequencies 
estimated from the CEU sample in HapMap. Part of the 
data are from Table 2 of Nicolae (2006) 



Haplotype 

A ± -T-A 2 


of SNPs 


Frequency 


Frequency 
from HapMap 


0-0-0-0 




0.088 


0.058 


0-0-1-1 




0.027 


0.017 


0-1-0-0 




0.302 


0.342 


0-1-1-0 




0.008 


0.008 


1-0-1-0 




0.242 


0.142 


1-0-1-1 




0.333 


0.433 



timates are quite similar and are omitted. When 
performing the prospective- and retrospective-score 
tests with imputed genotypes for the untyped SNP, 
we use the haplotype frequency estimates from 
HapMap to obtain the conditional probabilities 
Pr{G\Af(Gi)}, even though the case-control sample 
is actually from a population with moderately dif- 
ferent haplotype frequencies. To see the degree of 
recovery of missing information achieved by imputa- 
tion, we also perform the prospective- and 
retrospective-score tests based on the true genotypes 
at the untyped SNP. In addition, we perform the 
multimarker Hotelling's T 2 test based on genotypes 
at typed SNPs (Xiong, Zhao and Berwinkle, 2002; 
Chapman et al., 2003), which is equivalent to the 
prospective-score test derived from the logistic re- 
gression model (3) with the covariates m{G) given 
as the vector of genotypes for all the typed SNPs. 

Results for this simulation study are presented 
in Tables 5 (Scenario 1) and 6 (Scenario 2). It is 
seen that the score tests with imputed genotypes 
have size matching reasonably well with the nom- 
inal value of 1%, even though the imputation is 
based on haplotype frequencies that are obtained 
from the HapMap data and are different from the 
true frequencies. From the results regarding power, 
we see that imputing the untyped SNP in either 
the prospective- or the retrospective-score test can 
achieve substantial power gains as compared with 
the Hotelling's T 2 test based only on genotyped 
SNPs. The relative power improvement gained by 
imputation can still be quite remarkable even when 
the accuracy for predicting the untyped SNP using 
the genotyped SNPs is only of a moderate level (Sce- 
nario 2, where i? 2 = 0.39). On the other hand, the 
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prediction accuracy does affect the degree of recov- 
ery of the missing information that may be achieved 
by imputation: in Scenario 1, with perfect prediction 
of the untyped SNP, the tests using imputed geno- 
types do attain the full power we would obtain if the 
tests were based on the true genotype of the untyped 
SNP. In Scenario 2, with moderate prediction of the 
untyped SNP, imputation of the untyped SNP can 
recover partial but not full power. It is worth remem- 
bering that, with exact data, the retrospective-score 
test is usually more powerful than the prospective- 
score under the dominant or recessive model, and 
the two tests are essentially equivalent under the 
additive model. Here we observe the same phenom- 
ena when the prospective- and retrospective-score 
tests are based on imputed genotypes. 

As we noted earlier, when exact genotype data are 
available, the retrospective-score test is more sensi- 
tive to violation of the HWE assumption than the 
prospective-score test; that is, the former is usu- 
ally biased while the latter still remains unbiased 
when HWE does not hold. To assess the robustness 
properties for the prospective- and retrospective- 
score tests with imputed genotype data, we per- 
formed a further simulation study where the SNP 
haplotypes are still given as in Tables 3 and 4, but 
the haplotype pair H dl = (h a ,hb) for each person 
is given by the model with Pr{H dl = (h a ,hb)} = 
(1 - C)0 a b for h a ^ h b and Pr{H di = (h a ,h b )} = 
Q9 a + (1 — C)#a f° r = where 9 a is the fre- 
quency for haplotype h a , and £, the fixation in- 
dex quantifying the departure from HWE, is set to 
0.05. We can see from the results listed in Table 7 
that, with imputed genotype data, the prospective- 
score test, like its exact-data counterpart, still shows 
greater robustness in maintaining the type-I error 
rates than the retrospective-score test. In particu- 
lar, the retrospective-score test, based on the reces- 
sive or dominant model, may yield high type-I error 
rates under violation of HWE, no matter whether 
exact or imputed genotype data are used. Thus, 
an empirical-Bayes type shrinkage method that can 
adapt between prospective and retrospective meth- 
ods depending on bias-variance trade-off could be 
useful for analysis of both typed and untyped SNPs. 

We conclude this section with a discussion on the 
two types of association analyses recently developed 
for untyped SNPs: the full likelihood approach 
(Lin, Hu and Huang, 2008) and the two-stage ap- 
proach (Nicolae, 2006; Marchini et al., 2007). The 



Table 7 

Size (%) of the prospective- and retrospective-score tests 
(significance level — 0.01) based on the imputed and true (in 

parenthesis) genotypes at the untyped causal SNP, using 
SNP data generated according to Scenarios 1 (Table 3) and 

2 (Table 4) and a fixation index of 0.5 (violating HWE). 
Results are based on 3000 simulated data sets 





Prospective score 
imputed (true) 


Retrospective score 
imputed (true) 


Recessive model 






Scenario 1 


0.8 (0.8) 


1.7 (1.7) 


Scenario 2 


1.2 (1.2) 


5.9 (7.7) 


Dominant model 






Scenario 1 


0.9 (0.9) 


1.4 (1.4) 


Scenario 2 


1.0 (0.8) 


3.2 (5.1) 


Additive model 






Scenario 1 


1.0 (1.0) 


1.0 (1.0) 


Scenario 2 


0.7 (0.8) 


0.7 (0.8) 



full likelihood approach uses a retrospective likeli- 
hood for the case-control sample and a likelihood 
for the external (such as HapMap) data, by which 
the imputation and association analysis are simul- 
taneously performed in a one-stage manner. Con- 
versely, the two-stage approach performs the impu- 
tation and association analysis separately: imputing 
missing genotypes in the first stage and then per- 
forming association analysis in the second stage. In 
the imputation stage of the two-stage approach, one 
can apply existing powerful external imputation al- 
gorithms such as Nicolae (2006) and Marchini et al. 
(2007), and, hence, the two-stage approach is conve- 
nient to implement. There has been some debate on 
the efficiency difference between the two approaches 
(Marchini and Howie, 2008; Lin and Hu, 2008). Our 
simulation results (Tables 5 and 6) suggest that some 
of the efficiency difference between the full likeli- 
hood and the two-stage approaches may be due to 
the use of different likelihoods (prospective vs. ret- 
rospective) and not so much due to the use of one- 
stage vs. two-stage analysis. In this section we have 
shown that one can still use a retrospective likeli- 
hood even in a two-stage approach with powerful 
imputation performed at the first stage. 

4. HAPLOTYPES 

4.1 Definitions, Background and Missing Data 

Although single-SNP association tests are often 
the primary methods for genome-wide association 
scans, many secondary or "downstream" analyses 
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Table 5 

Size/Power (%) of the prospective- and retrospective-score tests (significance level= 0.01) based on the imputed and true (in 
parenthesis) genotypes at the untyped causal SNP, using SNP data generated according to Table 3 (perfect prediction). Also 
shown are results for the Hotelling's T 2 test based only on genotypes at the typed SNPs. Results for power (size) are based on 

1000 (3000) simulated data sets 



Prospective score Retrospective score Hotelling's T 2 
(3 imputed (true) imputed (true) 



Recessive model 















1.1 


(1.1) 


1.1 


(1.1) 


0.9 


0.5 


26.1 


(26.1) 


33.7 


(33.7) 


3.6 


0.6 


40.1 


(40.1) 


55.3 


(55.3) 


5.6 


Dominant model 















1.0 


(1.3) 


1.0 


(1.3) 


0.9 


0.3 


68.6 


(68.6) 


72.9 


(72.9) 


39.0 


0.4 


96.0 


(96.0) 


96.7 


(96.7) 


79.3 


Additive model 















1.2 


(1.2) 


1.2 


(1.2) 


0.9 


0.2 


43.0 


(43.0) 


43.0 


(43.0) 


24.2 


0.3 


86.4 


(86.4) 


86.4 


(86.4) 


65.5 



are often useful for detailed characterization of the 
risk of the disease associated with specific genomic 
regions of interest. One popular technique is haplotype- 
based association analysis, which involves studying 
the association of a disease with a genomic region in 
terms of the underlying "haplotypes," the combina- 
tion of alleles at multiple loci along individual ho- 
mologous chromosomes. Originally, haplotype-based 
association analysis was considered a powerful tech- 



nique for "indirect" association testing in situations 
where a causal SNP may not have been genotyped, 
but the haplotypes defined by multiple typed SNPs 
could serve as a good "surrogate" for the causal vari- 
ant. With the advent of various imputation meth- 
ods, although haplotype analysis has become less 
relevant for such indirect association testing, it re- 
mains a useful tool for parsimonious characteriza- 
tion of disease risk associated with multiple, possibly 



Table 6 

Size/Power (%) of the prospective- and retrospective-score tests (significance level — 0.01) based on the imputed and true (in 
parenthesis) genotypes at the untyped causal SNP, using SNP data generated according to Table 4 (moderate prediction). 
Also shown are results for the Hotelling's T 2 test based only on genotypes at the typed SNPs. Results for power (size) are 

based on 1000 (3000) simulated data sets 



Prospective score Retrospective score Hotelling's T 2 
[3 imputed (true) imputed (true) 



Recessive model 















1.4 


(1.2) 


1.2 


(1.2) 


1.1 


0.5 


42.6 


(92.2) 


47.0 


(97.6) 


17.6 


0.6 


59.4 


(99.1) 


66.4 


(99.9) 


24.9 


Dominant model 















0.8 


(1.1) 


0.9 


(1.0) 


1.1 


0.4 


48.5 


(95.6) 


54.3 


(98.2) 


23.8 


0.5 


71.6 


(99.6) 


77.2 


(100) 


41.5 


Additive model 















1.0 


(1.3) 


1.0 


(1.3) 


1.1 


0.3 


60.2 


(97.6) 


60.1 


(97.6) 


40.6 


0.4 


92.5 


(99.9) 


92.4 


(99.9) 


77.4 
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interacting, loci within a given region. Moreover, it 
is conceivable that for some regions, the haplotypes, 
and not the individual SNPs, are functional units 
and, thus, for these regions stronger signals of asso- 
ciations could be detected by performing haplotype- 
based regression analysis. 

A technical problem for haplotype-based regres- 
sion analysis is that typically the haplotype infor- 
mation for the study subjects is not directly ob- 
servable. Instead, locus-specific genotype data are 
observed, which contain information on the pair of 
alleles a subject carries, but does not provide the 
"phase information," that is, which combinations of 
alleles appear across multiple loci along the indi- 
vidual chromosomes. In general, the genotype data 
of a subject will be phase-ambiguous whenever the 
subject is heterozygous at two or more loci. Statis- 
tically, the lack of phase information can be viewed 
as a special missing data problem. 

For example, suppose A/a and B/b denote the 
major/minor alleles in two bi-allelic loci. A partic- 
ular haplotype pair, called a diplotype, is the pair 
of alleles that are inherited from one's parents. One 
such haplotype pair would be (AB) — (ab), and dis- 
ease risk can be associated with the number of copies 
of particular haplotypes that one inherits. Unfortu- 
nately, the diplotypes are not observable directly, 
but instead we can observe only the unordered or 
combined genotypes, in this case (Aa) at the first 
locus and (Bb) at the second locus, that is, (AaBb). 
However, when observing only the genotypes, the ac- 
tual haplotype pair is unknown, or "phase ambigu- 
ous," because the haplotype pair (Ab) — (aB) has the 
same set of unordered genotypes. Confronted with 
the unordered set of genotypes (AaBb), we know 
that the actual haplotype pair is either (AB) — (ab) 
or (Ab) — (aB), but we must use probability mod- 
els to take into account the phase ambiguity when 
performing statistical inference. 

In Section 2 we described "model-free" prospec- 
tive and "model-based" efficient retrospective meth- 
ods for analyzing SNP data, and we also described 
empirical-Bayes methods that data-adaptively move 
between the two. Just as in SNP data, for haplo- 
type data there are also model-free and model-based 
methods, and accompanying empirical-Bayes meth- 
ods. 

A variety of methods have been developed for 
haplotype-based analysis of case-control data using 
the logistic regression model (Zhao, Li and Khalid, 
2003; Lake et al., 2003; Epstein and Satten, 2003; 



Satten and Epstein, 2004; Spinka, Carroll and Chat- 
terjee, 2005; Lin and Zeng, 2006; Chatterjee et al., 
2006; Chen, Chatterjee and Carroll, 2009). Consider 
a general risk model similar to (3) but with the ad- 
dition of environmental factors (E) and written in 
terms of the diplotypes, denoted as H : 

pr(D = l\H di ,E) 

exp{a + m(H dl ,E,(3)} 
~ 1 + exp{a + m(H di , E, j3)} ' 

where the function m(-) is chosen in a suitable way 
to reflect an assumed mode of genetic effect. For ex- 
ample, suppose we are interested in the particular 
haplotype h* = (ab). A model that assumes an ad- 
ditive effect of this haplotype would have m(H dl = 
h dl ,E) linear in the number of copies of the haplo- 
type h*. 

4.2 Model-Based and Model-Free Methods 

4.2.1 Identifiability The data setup then is that 
we have observations on environmental exposure (E), 
genotypes G and cases and controls D. What is 
missing is the underlying diplotype H dl . The retro- 
spective likelihood is still (2), but the risk of disease 
depends on the diplotype H dl and not otherwise on 
the genotype. 

While models such as (11) seem straightforward 
enough for random samples, in retrospective sam- 
ples a problem arises because of the phase ambi- 
guity. In particular, all components of j3 may not 
be identifiable if the distribution of (H dl ,E) is left 
completely unrestricted (Epstein and Satten, 2003; 
Lin and Zeng, 2006). Thus, to make progress, some 
type of distributional assumptions are needed. Here 
we will distinguish between two approaches, both of 
them retrospective in nature but with different dis- 
tributional assumptions. The first we call "model- 
free" in that very little is actually assumed about 
the haplotype distribution. If haplotypes were ob- 
servable, this method reduces to ordinary prospec- 
tive logistic regression, while in the rare disease case 
with phase ambiguity, the method reduces to that of 
Zhao, Li and Khalid (2003). The second approach, 
which we call "model-based," makes much stronger 
assumptions about the haplotype distribution, and 
reduces to the efficient retrospective approach of 
Chatterjee and Carroll (2005) if haplotypes were ob- 
servable. The model-free method will thus be more 
robust but less efficient than the model-based method 
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4.2.2 Model-based method The model-based 
method (Spinka, Carroll and Chatterjee, 2005) has 
three aspects: 

(A.l) Haplotypes and the environment are assumed 

independent in the population. 
(A. 2) The diplotypes are assumed to be in HWE in 

the population, so that 

w (H di = h di = (h a ,h b )\E) 
= q{h di = (h a ,h b ),9} 



91, if h a = h b , 

26 a 6 b , ifh a ^h b , 



where 8 S denotes the population frequency for 
the haplotype h s . 
(A. 3) The distribution of the environmental vari- 
able E is left completely nonparametric. 

The methodology Spinka, Carroll and Chatterjee 
(2005) used to construct their profile likelihood was 
a nonparametric maximum likelihood estimator over 
the unknown distribution of E. However, there is an 
alternative derivation, one that is both more intu- 
itive and much easier to work out. Indeed, it is a not 
sufficiently well-known fact that for most purposes 
a case-control study can be viewed as a prospective 
study with missing data. Consider a sampling sce- 
nario where each subject from the underlying pop- 
ulation is selected into the case-control study us- 
ing a Bernoulli sampling scheme where the selection 
probability for a subject given his/her disease status 
D = d is proportional to Nj/ pr(D = d). Inference 
with the actual case-control data can then be based 
on the pseudo-likelihood derived for such an alter- 
native sampling scenario. Let 5 = 1 denote that a 
subject is selected in the case-control sample under 
this Bernoulli sampling scheme and hence has been 
observed. Then in this alternative sampling scheme, 
and with the assumptions stated above, 
Spinka, Carroll and Chatterjee (2005) compute 
pi(D = 1,G = g\E, 5 = 1). This calculation is simple 
and in the rare disease case the resulting efficient 
model-based likelihood function reduces to 

L mo dci(D, G, E, fi) 

= q(h di ,6)exp[D{K + m(h di ,E,P)}} 

h di £H G 



(12) 



■exp[s{K + m(h di ,E,p)}]j, 

where pd = N^/N, ir^ = pr(D = d), k = a + 
log(pi/jPo)-log(7ri/7r ), = (f3,6,K), and U G is the 
set of diplotypes consistent with the observed geno- 
types G. 

4.2.3 Model-free method The two important model 
assumptions in the model-based estimator are (A.l) 
and (A. 2). Although because of identifiability some 
model assumptions must be made, they can be weak- 
ened tremendously, as follows (Chen, Chatterjee and 
Carroll, 2009): 

(B.l) The haplotype and the environment are inde- 
pendent in the population given the genotype 
G. 

(B.2) There population distribution for the diplo- 
types given the genotype G, called qfr ee (h dl \G, 9), 
can be derived assuming HWE. 

Following the same alternative sampling scheme as 
described in Section 4.2.2, or by doing a nonpara- 
metric maximum likelihood analysis, we can com- 
pute pr(D = 1\G, E,5 = 1) under assumptions (B.l), 
(B.2) and (A.3) to be 

Lfree(D, G, E, O) 

= Yl Qfrcc(h di \G,e) 



h di £H G 



(13) 



■exp[D{n + m(h dl ,E,P)}} 
=0 h di eH G 



■exp[s{K + m(/i di , £,/?)}] J. 

To see why the likelihood Lf rcc requires far weaker 
assumptions than L mo delj note that Lf ree requires 
the haplotype-environment independence and HWE 
assumption only to specify the conditional distribu- 
tion pv(H dl \G, X), while -Lmodei requires the same 
assumption to specify the entire joint distribution 
pi(H dl , X). As a result, £f roc requires the haplotype- 
environment independence and HWE only to resolve 
the phase ambiguous genotypes. The likelihood con- 
tribution for the subjects with phase unambiguous 
genotypes, that is, G = H , is the same as that for 
the standard prospective logistic regression. In con- 
trast, Lmodei depends on the assumptions (A.l) and 
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(A. 2) irrespective of whether a subject has a missing 
phase or not. 

Note that Lf Tee (D,G,E,Q) will contain little in- 
formation on 9 since it conditions on G. Thus, when 
implementing methods based on this likelihood, 
Chen, Chatterjee and Carroll (2009) proposed to re- 
place the score function for 9 by the estimating func- 
tion for 9 based on the genotype data from the con- 
trols and assuming that the haplotypes are in HWE 
in the population. 

4.3 Empirical-Bayes 

In Section 4.2.2 we constructed a profile likeli- 
hood under strong assumptions leading to an effi- 
cient method that will not be robust to violations 
of the two major assumptions. Conversely, in Sec- 
tion 4.2.3 we computed a profile likelihood leading 
to much more robust inference, but at the cost of 
a steep loss of efficiency. Similarly to Section 2.3, 
here we briefly review a fully sample size- and data- 
adaptive empirical-Bayes method that 
Chen, Chatterjee and Carroll (2009) described for 
gaining^ efficiency when warranted but is still robust. 

Let /3f ree and /3 mo dei be the model-free and model- 
based estimates, with jth components f3jj ree and 

model- Let V be the covariance matrix of ip = 
/3f ree — /3 mo d c i, with the jth diagonal element of V be- 
ing Vj\ a sandwich estimator Vj can be computed, al- 
though a nonparametric bootstrap can also be used. 
Then one can define the empirical-Bayes estimator 

Pj,EB = /3j,free + Wj (/3j, mo del ~ Pj,free)', 



Vj (/^j,free model) 2 

The intuition behind (14) is that if the model fails, 

(Pj, model — /9j,free) will be large relative to Vj, which 
as a variance is proportional to iV _1 , hence, Wj ~ 0, 
and, hence, the empirical-Bayes method will effec- 
tively become the model-free estimator. If, however, 
the model assumption holds, then Vj and (/3j,f re e — 
Pj, model) 2 are proportional to one another, so that 
Wj > and the empirical-Bayes estimate goes part 
way toward the model-based estimator, and hence 
gains efficiency over the model-free estimate. 
Chen, Chatterjee and Carroll (2009) describe the 
limiting distribution of (14) and how to compute 
an estimate of its variance. 

Chen, Chatterjee and Carroll (2009) illustrate ap- 
plication of the different methods in two case-control 
data examples. The examples were chosen in such a 



way that from a priori biologic grounds one would 
expect the gene-environment independence assump- 
tion to hold in one case, but not in the other. The 
two examples together illustrate how the different 
shrinkage estimators adapt to alternative scenarios 
of gene-environment distribution. 

5. DISCUSSION 

Researchers now increasingly use the Cochran- 
Armitage trend test as the primary method for single- 
SNP association testing in the GWAS. The test is 
known to have robust power for the detection of ef- 
fect of susceptibility SNPs under a range of realistic 
modes of inheritance that give rise to some sort of 
monotone relationship between disease risk and al- 
lele count. As noted in Section 2, the retrospective 
and prospective methods have very similar, if not 
identical, power under the trend model and thus ei- 
ther could be used as the primary method for anal- 
ysis of GWAS data. The trend test, however, can 
perform very poorly for the detection of SNPs for 
which the minor allele has a recessive effect. Thus, 
it is often recommended that a test under the re- 
cessive mode of inheritance be conducted as a sec- 
ondary step to detect SNPs with recessive effects 
that may be missed by the primary trend test of as- 
sociation. The use of the retrospective method can 
be potentially beneficial at this stage. One, however, 
has to be cautious about creation of false positive re- 
sults due to the violation of the HWE assumption. 
We recommend that if a retrospective method is to 
be used for potential power gain, then it should be 
used in conjunction with the empirical-Bayes type 
shrinkage estimation. Our numerical investigations 
suggest that such a method can indeed be more pow- 
erful than the conventional "prospective" methods 
without creating excess false positives; see Tables 1 
and 2. 

In this article, although we focus on association 
tests involving bi-allelic SNPs, the same issues are 
relevant for genetic association tests involving loci 
with more than two alleles. In particular, one can 
gain efficiency for analysis of case-control data by 
assuming HWE or other natural population-genetic 
models (Satten and Epstein, 2004; Lin and Zeng, 
2006) to specify multi-allelic genotype frequency for 
the underlying population. The sensitivity of the 
methods to underlying model assumption can be 
reduced by appropriate shrinkage estimation tech- 
niques. 
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The impact of population stratification (PS) can 
be very different for prospective and retrospective 
methods. As it is well known, the presence of 
population stratification, that is, the existence of 
hidden ethnic sub-structures in the population, can 
create confounding bias in all of the methods when 
both gene-frequency and disease risks vary across 
the underlying strata. The presence of PS can also 
cause large scale violation of the HWE assumption, 
thus making the retrospective method more suscep- 
tible to bias than its prospective counterpart. Our 
application of different methods to the CGEMS 
genome-wide association study data illustrated that 
the empirical-Bayes type procedure can correct for 
inflated type-I error that may exist for retrospective 
methods due to large scale violation of the underly- 
ing HWE assumption. 

The difference between prospective and retrospec- 
tive methods becomes more relevant for studies of 
gene-gene and gene-environment interactions, a topic 
that we have not directly addressed in this article. In 
particular, retrospective methods, such as the case- 
only analysis (Piegorsch, Weinberg and Taylor, 1994), 
which assumes gene-gene or/and gene-environment 
independence for the underlying population, can gain 
dramatic power for testing and estimation of odds 
ratio interaction parameters in the logistic regres- 
sion model. Given that standard case-control anal- 
yses often have poor power for detection of multi- 
plicative interactions due to small numbers of cases 
or controls in cells of crossing exposures, practition- 
ers often find it is tempting to use the more power- 
ful retrospective methods. The assumption of gene- 
environment independence, however, can be violated, 
either due to direct casual association between gene 
and environment or indirect association due to ef- 
fects of family history and hidden population strati- 
fication. The assumption of gene-gene independence 
between physically distant genes can also be violated 
due to population stratification. Thus, we believe 
the development of shrinkage (Mukherjee and Chat- 
terjee, 2008; Chen, Chatterjee and Carroll, 2009) 
and other types of data-adaptive techniques 
(Li and Conti, 2009) has been valuable for robust 
inference in case-control studies of genetic epidemi- 
ology. 
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